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Abstract 

Regularization is widely used in statistics and machine learning to prevent overfitting and gear solution 
towards prior information. In general, a regularized estimation problem minimizes the sum of a loss 
function and a penalty term. The penalty term is usually weighted by a tuning parameter and encourages 
certain constraints on the parameters to be estimated. Particular choices of constraints lead to the 
popular lasso, fused-lasso, and other generalized h penalized regression methods. Although there has 
been a lot of research in this area, developing efficient optimization methods for many nonseparable 
penalties remains a challenge. In this article we propose an exact path solver based on ordinary differential 
equations (EPSODE) that works for any convex loss function and can deal with generalized h penalties 
as well as more complicated regularization such as inequality constraints encountered in shape-restricted 
regressions and nonparametric density estimation. In the path following process, the solution path hits, 
exits, and slides along the various constraints and vividly illustrates the tradeoffs between goodness of fit 
and model parsimony. In practice, the EPSODE can be coupled with AIC, BIC, C v or cross-validation 
to select an optimal tuning parameter. Our applications to generalized l\ regularized generalized linear 
models, shape-restricted regressions, Gaussian graphical models, and nonparametric density estimation 
showcase the potential of the EPSODE algorithm. 

Keywords: Gaussian graphical model, generalized linear model, lasso, log-concave density estimation, 
ordinary differential equations, quasi-likelihoods, regularization, shape restricted regression, solution path 



1 Introduction 



Regu l arization is a 



1996 



Chen et al 



frequ ently used framework in statistics. Examples include the lasso regression (jTibshirani 



2001) and the l\ penalized generalized linear models (GLMs) among many others. For 
both the lasso and the l\ penalized GL Ms, efficient solution path algorithms have been proposed to ea se the 



tuning of the regularization parameter (jOsborne et al 



2000; 



Efron et al 



2004; 



Park and Hastie 



2007). Yet 



extension to other more general settings is nontrivial and has been an active research area. 
In this article, we consider a general regularization framework 



min f(J3) + p\\V/3 -d\\ 1+ p\\Wf3 - e|| + , 



(1) 
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for which we propose an efficient exact path solver based on ordinary differential equations (EPSODE). Here 
/ : MP i— >• R can be any convex, smooth loss function of (3 e W , where p > is the dimensionality of the 
parameters. For any vector v — (vi), \\v\\i = X)j \ v i\ denotes its l\ norm and ||u||+ = ^jmax{ui,0} is the 
sum of positive parts of its components. The EPSODE provides the exact solution path to (HJ as the tuning 
parameter p varies. 

1.1 Generality of (Op) 

The generality of (fTJ) is two-fold. First / can by any convex loss function. For example, it can be the negative 
log-likelihood function of GLMs, negative quasi-likelihood, the exponential loss function of the AdaBoost 



(Friedman et al 



2000), or many other frequently used loss functions in statistics and machine learning. 
Second we allow V and W to be any regularization matrices of p columns. This leads to broad applications. 
In particular, the first regularization term p||V/3 — e£|ji encourages equality constraints among parameters 
(3. When p is large enough, the minimizer (3(p) of (p} satisfies V (3(p ) = d. For instance, when V is th e 



identity matrix and d = 0, it recovers the well-known lasso regression 
which encourages sparsity of the estimates. When 

-1 1 



Tibshirani 



1996 



Chen et al 



2001), 



V = 



-1 1 



20051 ). which leads to smoothness 



and d = 0, it corresponds to the fused-lasso penalty (jTibshirani et al. 
among neighboring regression coefficients. As we will show later, more complicated equality constraints 
can be incorporated with properly designed V and d. On the other hand, the second regularization term 
pj|W/3 — e|| + enforces regularization by inequality relations among regression coefficients. For large enough 
p, the minimizer (3(p) satisfies W(3(p) < e. For instance, setting W as the negative identity matrix 
and e = encourages non negativity of the estimates, as required in nqnnegative least squares problem s 



( Lawson and Hanson 



19871 ). In the isotonic regression (|Robertson et al 



1988 



Silvapulle and Sen 



2005), 



the estimates have to be nondecreasing. This can be achieved by the regularization matrix 

W = 

V i -i 

and e = 0. More complicated constraints that occur in shape- restricted regression and nonparametric 
regressions also can be incorporated as we demonstrate in later examples. 

In certain applications, both equality and inequality regularizations are required. In that case, as shown 
in Section[51 at a large but finite p, the minimizer (3(p) coincides with the solution to the following constrained 
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optimization problem 

min f(fi) (2) 
s.t. V/3 = d and W/3 < e. 

Consequently EPSODE solves the linearly constrained estimation problem ([2]) as a by-product. In this case, 
path following commences from the unconstrained solution argmin/(/3) and ends at the constrained solution 
to ©. 

1.2 Previous Work 

Several path algorithms have bee n devised for special c ases of the general regularization problem ([T]). For 



example, the hom otopy method (jOsborne et al 



(Efron et al 



20001 ) and the least angle regression (LARS) procedure 



2004) handle lasso penalized least squares problem. The solution path g e nerat ed is piecewise 



linear and illustrates the tradeoffs between goodness of fit and sparsity. 



Rosset and Zhul (|2007T) give sufficient 



conditions for a solution pat h to be piecewise linear and ex pand its applications to a wider range of loss and 



penalty functions. Recently 



Tibshirani and Tavlorl (|201ll ) devise a dual path algorithm for generalized l\ 



penalized least squa res problems, whi c h is p roblem (|TJ) with / quadratic but without the second inequality 



regularization term. 



Zhou and Langd ([20111 ) consider ([T]) in full generality for quadratic /. All these work 



concerns regularized linear regression for which the solution path is piecewise linear. Several attempts have 
been made to pa th following for regularized GLMs for which the solution path is no longer piecewise li near. 



Park and Hastid ([20071 ) propose a predictor-corrector approach to approximate the lasso path for GLMs. 



Wu 



201 1[ ) presents an ordinary differen tial equation-based path algorithm which delivers the exact path for lasso 



penalized GLMs. iFriedmanl ([20081 ) derives an approximate path algorithm for any convex loss regularized 
by a separable, but not necessarily convex penalty. Here a penalty function is called separable if its Hessian 
matrix is diagonal. The separability restriction on the penalty term excludes many important problems 
encountered in real applications. 

Our proposed approach generalizes previous work in several aspects. First, it works for any convex 
loss (or criterion) function. Second, it allows for any type of regularization in terms of linear functions 
of parameters, equality or inequality. Equality constrained regularizations include lasso, fused-lasso and 
generalized l\ penalty for example. Inequality constrained regularizations are required in shape-restricted 
regression and nonparametric log-concave density estimation. Last but not least, it is an exact path algorithm. 

1.3 A Motivating Example 



For illustration, we consider a merger and acquisition (M&A) data set studied in ([Fan et al 



20111 ). This data 



set constitutes n = 1,371 US companies with a binary response variable indicating whether the company 
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becomes a leveraged buyout (LBO) target (yi = 1) or not (yi — 0). Seven covariates (1. cash flow, 2. cash, 
3. long term investment, 4. market to book ratio, 5. log market equity, 6. tax, 7. return on S&P 500 
index) are recorded for each company. There have been intensive studies on the effects of these factors on 
the probability of a company being a target for strategic mergers. Exploratory analysis using linear logistic 
regression shows no significance in most covariates. 

To explore the poss i bly n onlinear effects of these quantitative covariates, the varying-coefficient model 



([Hastie and Tibshirani . 



19931 ) can be adopted here. We discretize each predictor into, say, 10 bins and 



fit a logistic regression. The first bin of each predictor is used as the reference level and effect coding is 
applied to each discretized covariate. The circles (o) in Figure Q] denote the estimated coefficients for each 
bin of each predictor and hint at some interesting nonlinear effects. For instance, the chance of being an 
LBO target seems to monotonically decease with market-to-book ratio and be quadratic as a function of log 
market equity. Regularization can be utilized to borrow strength between neighboring bins and gear solution 
towards clearer patterns. To illustrate the flexibility of the regularization scheme ([!}, we apply cubic trend 
filtering to 5 covariates (cash flow, cash, long term investment, tax, return on S&P 500 index), impose the 
monotonicity (non-increasing) constraint on the 'market-to-book ratio' covariate, and enforce the concavity 
constraint on the 'log market equity' covariate. This can be achieved by minimizing a regularized negative 
logistic log-likelihood of form 

-i((3 1 ,...,f3 7 ) + P £ wv^u + pY, WWjPiU, 

35*4,5 3=4,5 

where /3 • is the vector of regression coefficients for the j-th discretized covariate. The matrices in the 
regularization terms are specified as 



( -1 2-1 
1 -4 6 



1-4 6 - 



1 

4 1 



for j = 1,2, 3, 6, 7, 



I -1 1 



W 4 



6 -4 1 

-12-1/ 

\ 



-1 1 



-1 1 



V 

(1 -2 
1 



, and 



1 1 / 



1 

-2 1 



1 -2 1 

1-21/ 
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The equality constraint regularization matrix V j, j = 1, 2, 3, 6, 7, penalizes the fourth order finite differences 
between the bin estimates. Thus, as p increases, the coefficient vectors of covariates 1-3,6-7 tend to be 
piecewise cubic with two ends being linear, mimicking the natural cubic sp line. This is one example of 



the polynomial trend filtering (jKim et al 



2009: 



Tibshirani and Taylor 



20111 ). Similar to semi-parametric 



regressions, regularizations in polynomial trend filtering 'let the data speak for themselves'. In contrast, the 
bandwidth selection in semi-parametric regressions is replaced by parameter tuning in regularizations. The 
number and locations of knots are automatically determined by tuning parameter which is chosen according 
to model selection criteria. In a similar fashion, the coefficient vector gradually becomes monotone for 
covariate 'market-to-book ratio' and concave for covariate 'log market equity'. In addition, with p large 
enough, we recover the corresponding constrained solution, which are shown by the crosses (+) on solid 
lines in Figure [Tj As noted above, our exact path algorithm delivers the whole solution path bridging from 
the unconstrained estimates (denoted by o) to the constrained estimates (denoted by +). For example, the 
dotted lines in Figure [Tj is a snapshot of the solution at p = 0.6539. Availability of the whole solution path 
renders model selection along the path easy. For instance the regularization parameter p can be chosen by 
minimizing the cross-validation error or other model selection criteria such as AIC, BIC, or C p . Figure [5] 
displays the solution path and the AIC and BIC along the path. It shows that both criteria favor the fully 
regularized solution, namely the constrained estimates. The whole solution path is obtained within seconds 
on a laptop using a Matlab implementation of EPSODE. 

The patterns revealed by the regularized estimates match some existing finance theories. For instance, 
a company with low cash flow is unlikely to be an LBO target because low cash flow is hard to meet the 
heavy debt burden associated with the LBO. On the other hand, company carrying a high cash flow is likely 
to possess a new technology. It is risky to acquire such firms because it is hard to predict their profitability. 
The tax reason is obvious from the regularized estimates. The more tax the company is paying, the more 
tax benefits from an LBO. Log of market equity is a measure of company size. Smaller companies are 
unpredictable in their profitability and extremely large companies are unlikely to be an LBO target because 
LBOs are typically financed with a large proportion of external debts. Interested readers are referred to 



([Shivdasani and Wang 



20091 ) and references therein for related theories on LBO. 
This illustrative example demonstrates the flexibility of our novel path algorithm. First, it can be applied 
to any convex loss function. In this example, the loss function is the negative log-likelihood of a logistic 
model. Second, it works for complicated regularizations like polynomial trend filtering (equality constraints), 
monotonicity constraint, and concavity constraint. More applications will be presented in Section [7] to 
illustrate the potential of EPSODE. 

The rest of the paper is organized as follows. Section [2] reviews the exact penalty method for optimiza- 
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Figure 1: Snapshots of the path solution to the regularized logistic regression on the M&A data set. 




G 



tion. Here the connections between constrained optimization and regularization in statistics are made clear. 
Section [3] derives in detail the EPSODE algorithm for strictly convex loss function /. Its implementation 
via the sweep operator and ordinary differential equations are described in Section 2) An extension of EP- 
SODE for / convex but not necessarily strictly convex is discussed in Section [SJ Section [5] concerns model 
selection along the path. Section [7] presents various applications of EPSODE. Finally, Section [5] discusses 
the limitations of the path algorithm and hints at future generalizations. 

2 Exact Penalty Method for Convex Constrained Optimization 

Consider the convex program 

min f{x) 

s.t. gi(x) = 0, 1 < i < r (3) 
h 3 (x) < 0,1 < j < s, 

where the objective function / is convex, equality constraint functions <?i are affine, and the inequality 
constraint functions hj are convex. We further assume that / and hj are smooth. Specifically we require 
that / and hj are continuously twice differentiablc. To fix notation, differential df (x) is the row vector of 
partial derivatives of / at x and the gradient V/(x) is the transpose of df(x). The Hessian matrix of /(•) 
is denoted by d 2 f(x). 

Exact penalty method minimizes the function 

r s 

E p (x) = f(x)+pJ2\9i(^)\+pJ2 max i^ h o( x )} W 



for p > 0. Classical results (|Ruszczvhski 



2006, Theorems 6.9 and 7.21) state that for p large enough, 
the solution to the optimization problem (|4]) coincides with the solution to the original constrained convex 
program ([3]). This justifies the ex act penalty method as one way to solve constrained optimization problems. 



According to convex calculus (jRuszczvhski 



20061 Theorem 3.5), the optimal point x(p) of the function 



£ P {x) is characterized by the necessary and sufficient condition 



= Vf(x) + p^2s i Vg i (x) + p^2t j Vh j (x) (5) 

i=i j=i 



with coefficients satisfying 



'{-1} 9 t(x)<0 ({0} h j (x)<0 

Si e{ [-1,1] gi(x)=0, and tj G < [0, 1] h 3 (x) = . (6) 

U} g t (x)>0 {{1} h j (x)>0 
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The sets defining possible values of Sj and tj are the sub-differentials of the functions |x| and x+ = max{x, 0}. 
For path following to make sense, we require uniqueness and continuity of the solution x(p) to ((3]) as p varies. 
The following lemma concerns the continuity of the solution path and is the foundation of our path algorithm. 

Lemma 2.1. 1. (Uniqueness) If £ p is strictly convex, then its minimizer x{p) is unique. 

2. (Continuity) If £ p is strictly convex and coercive over an open neighborhood of p, then the minimizer 
x(p) is continuous at p. 

3. (Continuity of s, andtj) Furthermore, if the gradients {Vgi(x) : gi{x) = 0} U {Vhj (as) : hj(x) = 0} of 
active constraints are linearly independent at the solution x(p) over an open neighborhood of p, then 
the coefficient paths Si(p) and tj{p) are unique and continuous at p. 



20061) ■ For continuity 



Proof. The uniqueness of minimum under strict convexity is well-known (jRuszczvnskil 
suppose that the solution x(p) is not continuous at p. Then there exists e > and a sequence p n — > p such 
that \\x(p n ) — cc(p)|| > e for all n. Since £ p is coercive, x(p n ) is bounded and there exists a subsequence 
of x(p n ) that converges to some point y. Taking limits in the inequality £ Pri [x{p n )\ < £ Pn (x) shows that 
£p{y) < £p( x ) f° r au x i i- c -i V = x (p)- This contradicts with \\y — x(p)\\ > e. Therefore x(p) is continuous 
at p. For the continuity of coefficients, under the linearly independence assumption, Si(p) and tj{p) can be 
uniquely solved by the stationarity condition ([5]) given solution vector x{p). Therefore continuity of Si{p) 
and tj(p) inherits from continuity of x(p). □ 

We remark that strict convexity only gives an easy-to-check sufficient condition for uniqueness and continuity; 
it is not a necessary condition. A convex but not strictly convex function can still have a unique minimum. 
The absolute value function \x\ is such an example. When the loss function / is strictly convex, then £ p is 
strictly convex for all p > and by Lemma l2. II there exists a unique, continuous solution path {x(p) : p > 0}. 
In Section [3] and 01 we derive the path algorithm assuming that / is strictly convex. When / is convex but 
not strictly convex, e.g., when n < p in the least squares problems, the solutions at smaller p may not be 
unique. In that case, it is still possible to obtain a solution path over the region of large p where the minimum 
of £ p is unique. In Section [SJ we extend EPSODE to the case / is convex but may not be strictly convex. 
The third statement of Lemma I2TT1 implies that the active constraints (gi(x) — or hj(x) — 0) with interior 
coefficients must stay active until the coefficients hit the end points of the permissible range, which in turn 
implies that the solution path is piecewise smooth. This allows us to develop a path following algorithm 
based on ODE. 
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3 The Path Following Algorithm 

In this article, we specialize to the case where the constraint functions gi and hj are affine, i.e., the gradient 
vectors Vgi{x) and Vhj{x) are constant. This leads to the regularized optimization problem formulated 
as ([T]) by defining gi and hj as constraint residuals gt(x) = v\x — dj and hj(x) = WjX — ej. In principle 
a similar path algorithm can be developed for the general convex program where the inequality constraint 
functions hj are relaxed to be convex. But that is beyond the scope of the current paper. In Sections [3] and 
2J we assume that the loss function / is strictly convex. This assumption is relaxed in Section [5J 

Our path following algorithm EPSODE works in a segment-by-segment fashion. Along the path we keep 
track of the following index sets determined by signs of constraint residuals 



Ae = {i ■ 9i{x) = v\x - di < 0}, 
Ze = {i ■ gi{x) = v\x -di = 0}, 
V E = {i ■ gi{x) = v\x -<k> 0}, 



Aj = {j : hj(x) = w^x - ej < 0} 



Zi = {j : hj(x) = w\x 



0} 



(7) 



V\ = {j '■ hj(x) = w)x - ej > 0}. 



Along each segment of the path, the set configuration is fixed. This is implied by the continuity of both the 
solution and coefficient paths established in Lemma HOI Throughout this article, we call the constraints in 
Ze or Z\ active and others inactive. 

Next we derive the ODE for the solution x(p) on a fixed segment. Suppose we are in the interior of a 
segment. Let x(p) be the solution of Q indexed by the penalty parameter p and x(p+ Ap) the solution when 
the penalty is increased by an infinitesimal amount Ap > 0. Then the difference Ax(p) = x(p + Ap) — x{p) 
should minimize the increase in optimal objective value. That is, to the second order, Ax is the solution to 

mhiAx £p+A P (x + Ax) - £ p {x) 



df(x) ■ Ax + -Ax* ■ d 2 f{x) ■ Ax 



ieAf B jeVi 



Ax 



(8) 



+Ap- 



ieMs iev E jeVi 



s.t. v\ ■ Ax = 0,i e Z E , 
w\ ■ Ax = 0, j G Zi. 



Note that the active constraints have to be kept active since the set configuration is fixed along this segment 
by Lemma 12.11 This is why we have these two sets of equality constraints. To ease notational burden, we 
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define 



H(x) = d 2 f(x) (9) 



i&N-E ieP E jePi 



This leads to the corresponding Lagrange multiplier problem 



H(x) U% \ ( Ax \ ( -Vf(x)-(p + A P )u 2 
U z )\ \ z J I 



where the rows of the matrix Uz are the constant differentials, v\, i g Zs, and u;*, j G Z\(x), of the active 
constraint functions. Denoting the inverse of matrix as 



H(x) U z \ 1 = f P(x) Q(x) 
U z J \ Q*(x) R(x) 



where 



P(x) = H^ix) - H- x (x)U z [UzH^ix^z] ^ UzH~\x) 

Q{x) = H- 1 {x)U z \^J z H- 1 {x)U t z Y 1 (10) 

R(x) = -[UzH-^x^z^Y 1 , 

the solution of the difference vector Ax is 

Ax = -P(x)[Vf(x) + {p + Ap)uz] 

= -P(x)[Vf(x) + P Uz(x) + Ap-Uz] 
= -P(x)[-pU t z rz + Ap-uz}. 

Note P{x)U z = 0. Therefore Aa; = — Ap ■ P{x)u2- This gives the direction for the infinitesimal update of 
solution vector x(p). Taking limit in Ap leads to the following key result for developing the path algorithm. 

Proposition 3.1. Within interior of a path segment with set configuration ([7J) ; the solution x(p) satisfies 
an ordinary differential equation ( ODE) 

^ - (ID 
where the matrix P(x) and vector v,g are defined by \10\) and (0). 

Note that the right hand side of (fTTj) is a constant vector in x when / is quadratic and gi and hj are affine . 



Thus the corresponding solution path is piecewise linear. This recovers the case studied in (jZhou and Lange 



20111 ). The differential equation (fTTj) holds on the current segment until one of two types of events happens: 



an inactive constraint becomes active or vice versa. The first type of event is easy to detect - whenever a 
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constraint function, gi(x), i 6 A/e U Pe, or hj(x), j € A/j U Pi, hits zero, we move that constraint to the 
active set -Ee or -2j and start solving a new system of differential equations. To detect when the second type 
of event happens, we need to keep track of the coefficients Si(x) and tj(x) for active constraints. Whenever 
the coefficient of an active constraint hits the boundary of its permissible range in ([B]), the constraint has 
to be relaxed from being active in next segment. It turns out the coefficients for active constraints admit a 
simple representation in terms of current solution vector. 

Proposition 3.2. On a path segment with set configuration (M), the coefficients Si and tj for active con- 
straints are 



-Vf(x)+u z 
P 



(12) 



where x = x(p) is the solution at p and the matrix Q(x) is defined by ilO\) . 
Proof. Stationarity condition ([5]) implies 

U z r z = --Vf(x) - U%r z = --Vf(x) - u z . 
P P 

Multiplying both sides by Q{x) gives (fT2|) . □ 



Given current solution vector x(p), the coefficients of the active constraints are readily obtained from (|12l) . 
Once a coefficient hits the end points, we move that constraint from the active set to the inactive set that 
matches the end point being hit. In next section, we detail the implementation of the path algorithm. 

4 Implementation: ODE and Sweeping Operator 

Algorithm 1 EPSODE: Solution path for regularization problem ([I} with strictly convex /. 
Initialize p = 0, /3(0) = argmin/(/3) and its set configuration ([7]). 
repeat 

Solve ODE ([TT]) until an inactive constraint becomes active or the coefficient (fT2]) of an active constraint 
hits boundary. 

Update the set configuration ([7]). 
until A/e = "Pe = Vi = 



Algorithm [T] summarizes EPSODE based on Propositions 13.11 and 13.21 It involves solving ODEs segment 
by segment and is extremely simple to implement using softwares with a reliable ODE solver such as the 
ode45 function in Matlab and the deSolve package (?) in R. There has been extensive research in applied 
mathematics on numerical methods for solving ODEs, notably the Runge-Kutta, Richardson extrapolation 
and predictor-cor r ector methods. Some path following algorithms developed for specific statistical problems 



( Park and Hastic 



2007 



Friedman . 



2008) turn out to be approximate methods for solving the corresponding 
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ODE. IWul (|2011h first explicitly uses ODE to derive an exact solution path for the lasso penalized GLM 
The connection of path following to ODE relieves statisticians from the burden of developing specific path 
algorithms for a variety of regularization problems. 

Any ODE solver repeatedly evaluates the derivative. Suppose the number of parameters is p. Computa- 
tion of the matrix-vector multiplications in (jlll) and (fT2")l has computation cost of order 0(p 2 ) + 0(p\Z\) + 
0(\Z\ 3 ) if the inverse H^ 1 of Hessian matrix of loss function / is readily available, where Z = Ze U Z\ and 
\Z\ denotes its cardinality. Otherwise the computation cost is 0(p 3 ) + 0(p\Z\) + 0(|Z| 3 ). 

An alternative implementation avoids repeated matrix inversions by solving an ODE for the matrices P, 

sweep and inverse 



sweep opera t ors of regression analysis ([Dempster 
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Dempster . 


1969 


Goodnight, 


1979 


Jennrich 


1977: 



Lange . 



Little and Rubinl . 



2010). Suppose A is an m x m symmetric matrix. Sweeping on the fcth diagonal entry cikk 7^ 
of A yields a new symmetric matrix A with entries 

1 



a kk 

aik 
akj 



i ^ k 



auk 
aik 

akk ' 
akk 



a 



a%kakj . 
akk 



These arithmetic operations can be undone by inverse sweeping on the same diagonal entry. Inverse sweeping 
on the kth diagonal entry sends the symmetric matrix A into the symmetric matrix A with entries 

1 



akk 
a>ik 
akj 



akk 
aik 

akk ' 

a kj 

akk 



13 



i £ k 

a%kakj . •11 
akk 



Both sweeping and inverse sweeping preserve symmetry. Thus, all operations can be carried out on either 
the lower or upper triangle of A alone, saving both computational time and storage. When several sweeps 
or inverse sweeps are performed, their order is irrelevant. 

At beginning (p = 0) of the path following, we initialize a sweeping tableau as 



H~\x) 


H- 1 {x)U t 




UH- L {x)U l 



where the matrix U £ K( r + s ) x P holds all constraint differentials v\ and id* in rows. Further sweeping of 
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diagonal entries corresponding to the active constraints yields 





Q(x) 


P{x)U% 




R(x) 


Q\x)U% 


* 


* 


UzP(x)U% , 



(13) 

Here we conveniently organized the columns of the swept active constraints before those of un-swept ones. In 
practice the sweep tableau is not necessary as in (I13[) and it is enough to keep an indicator vector recording 
which columns are swept. The key elements for the path algorithm magically appear in the sweep tableau 

(USD 

dx(p) 



dp 
rz(p) 



-P{x)U%r z 



1 



-Q t {x)U%r z --Q\x)Vf{x). 



Therefore path following procedure only involves solving ODE for the whole sweep tableau (1131) with sweeping 
or inverse sweeping at kinks between successive segm ents. For this purpose we deriv e the ODE for the sweep 



tableau ([13]) . We adopt the convenient notations in ([Magnus and Neudeckei 
F(X) : R nx i R mx P, 



1999). For a matrix function 



DF{X) 



dvecF(X) 



d(vecXy 

denotes the mp x nq Jacobian matrix. For example, Proposition 13. II states Dx(p) = —P(x)u z . 

Proposition 4.1 (ODE for Sweep Tableau). On a segment of path with fixed set configuration, the matrices 
P{p), Q(p) and R(p) satisfy the ordinary differential equations (ODE) 

DP(p) = [P(x) <g> P(x)} ■ [DH{x)] ■ P{x)u 2 
DQ(p) = [Q\x) ® P(x)} ■ [DH{x)\ ■ P{x)uz 
DR{p) = [Q\x) ® Q\x)] ■ [DH{x)\ ■ P{x)u £ . 

Proof. First consider 

R(x) = - [UzH^WU^Y 1 = -M _1 (x). 



By chain rule ([Magnus and Neudecker 



19991 p91), 



DR(p) = DR(M) ■ DM(H) ■ DH{x) ■ Dx(p) 

= [R(x) ® R(x)} ■ DM(H) ■ DH{x) ■ Dx(p) 

= ~[R(x) ® R(x)] ■ {[U z H-\x) ® UzH~ l (x)] ■ DH(x)} ■ Dx(p) 

= [Q*(x) ® Q t {x)] ■ [DH(x)] ■ P{x)uz- 
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Similar calculations yield formula for Q(x) = —H 1 (x)U t z R(x) and P(x) — H 1 (x) — Q(x)U zH l {x). 

□ 

Solving ODE for these matrices requires the p 2 -hy-p Jacobian matrix of the Hessian matrix H(x) = 



which we provide for each example in Section [7] for convenience. When the number of parameter p is large, 
DH is a large matrix. However there is no need to compute and store DH and we are only required to 
compute the matrix vector multiplication DH ■ v for any vector v. In light of the useful identity (B l ® 
A)vec(C) = vcc(ACB), evaluating the derivative for the whole tableau only involves multiplying three 
matrices and incurs computational cost 0(p 3 ) + 0(p 2 \Z\) + 0(p\Z\ 2 ). 

Although we have presented the path algorithm as moving from p — to large p, it can be applied 
in either direction. Lasso and fused-lasso usually start from the constrained solution, while in presence of 
general equality constraints, e.g., polynomial trend filtering, and/or inequality constraints, the constrained 
solution is not readily available and the path algorithm must be initiated at p = 0. 



So far we have assumed strictly convexity of the loss function /. This unfortunately excludes many interesting 
applications, especially p > n case of the regression problems. In this section we briefly indicate an extension 
of EPSODE to the case / is convex but not necessarily strictly convex. In the proof of Proposition [371] the 
infinitesimal change of solution Aa; is derived via minimizing the equality-constrained quadratic program 
([5]), the solution to which requires inverse of Hessian if _1 and thus strict convexity of /. Alternatively we 
may solve © via reparameterization. Let U z hold the active constraint vectors and Y £ Rp x (jH z I) be a 
null space matrix of Uz, i-e., the columns of Y are orthogonal to the rows of U z- Then the infinitesimal 
change can be represented as Aa; = YAy for some vector Ay £ W. p ~\ z \. Under this reparameterization, the 
quadratic program © is equivalent to 



fix), 



DH{x) 



<9[vecff (a;)] dvcc[df 2 (x) 
dvec(xY dvec(xy 



5 Extension of EPSODE 




with explicit solution 



Ay 



[Y t H{x)Y]- 1 Y t [Vf{x) + (p + Ap)u 2 ]. 
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Hence the infinitesimal change in x{p) is 

Ace = -Y[Y t H{x)Y]- 1 Y t [Vf{x) + (p + Ap)wj] 
= -Ap • Y[Y t H(x)Y]- 1 Y t u£\. 

Again taking limit gives the following result in parallel to Proposition l3.il 

Proposition 5.1. Within interior of a path segment with set configuration TO, the solution x(p) satisfie 
an ordinary differential equation ( ODE) 

dx (p) vrvtir^vi-ivi 



dp 

where Y is a null space matrix of U z ■ 



Y[Y t H{x)Y}- 1 Y t Uz (14) 



An advantage of (|14p is that only non-singularity of the matrix Y t H(x)Y is required which is much weaker 
than the non-singularity of H . The computational cost of calculating the derivative in (|T4")) is 0((p— \Z\) 3 ) + 
0{p{p — \Z\)), which is more efficient than (|11[) when p — \Z\ is small. However it requires the null space 
matrix Y, which is nonunique and may be expensive to compute. Fortunately the null space matrix Y 
is constant over each path segment and in practice can be calculated by QR decomposition of the active 
constraint matrix U z- At each kink either one constraint leaves Z or one enters Z. Therefore Y can be 



sequentially updated (jLawson and Hanson 



1987) and need not to be calculated anew for each segment. 
Which version of (fTTj) and (fT4"|) to use depends on specific application. When the loss function / is not 
strictly convex, e.g., p > n case in regression analysis, only (|14[) applies. Interested readers are referred to 



(jNocedal and Wrightl . 



2006) for a similar dilemma in optimization methods. 



6 Model Selection Along the Path 

In applications such as penalized GLMs, the tuning parameter p in the regularization problem ([1} is chosen 
by a model selection criterion such as AIC, BIC, C p , or cross-validation. The cross validation errors can be 
readily computed using the solution path output by EPSODE. Yet the AIC, BIC, and C p criteria require an 
estimate of the degrees of freedom of estimate /3(p). Specifically AIC and BIC are defined by 

AIC = -£(/3(p)) + df(/3(p)) 
BIC = -m P )) + ^df(/3(p)), 

where —(■{•) denotes the negative log-likelihood and df(/3 p ) is the degrees of freedom for estimate (3 p . We 
propose to use 

df(/3(p))=p-|Z E UZi| (15) 



15 



as a measure of the degrees of freedom under GLMs. It is p reviously shown that (fT5|) is an unbia sed estimate 



of the degrees of freedom for lasso penalized least squares ([Efron et al 



lasso penaliz ed least squares (Tibshi rani and Taylor 



problem (fTJ) ([Zhou and Lange , 



Efron et al.. 


2004; 


Zou et al.. 


2007) 



20111) . and the least squares version of the regularized 



20111 ). Using the same degrees of freedom formula (fT5|) for GLMs is just ified 



by the local approximation of GLM loglikelihood by weighted least squares. See (jPark and Hastie 
details. 



20071) for 



7 Applications 

In this section, we collect some representative regularized or constrained estimation problems and demon- 
strate how they can be solved by path following. For all applications, we list the first three derivatives of the 
loss function / in (TT]). In fact, the third derivative is only needed when implementing by solving the ODE 
for the sweep tableau. 

7.1 GLMs and Quasi-Likelihoods with Generalized l\ Regularizations 

The generalized linear model (GLM) deals with exponential families in which the sufficient statistics is Y 
and the conditional mean /i of Y completely determines its distribution. Conditional on the covariate vector 
x eW, the response variable y is modeled as 

'y(x,0)-ft>((x,0)) 



p(y\x;/3,a) cx exp 



c(a) 



(16) 



where the scalar a > is a fixed and known scale parameter and the vector (3 is the parameters to be 
estimated. The function tp : K. h-> E is the link function. When i/el, tp(u) = u 2 /2 and c(er) = a 2 , (Ti"6|) 
is the normal regression model. When y G {0, 1}, tf)(u) — ln(l + exp(u)) and c(cr) = 1, (|16p is the logistic 
regression model. When y G N, iJj(u) = cxp(u), and c(cr) = 1, (| 16[) is the Poisson regression model. 

The quasi-likelihoods generalize GLM without assuming a specific distribution form of Y. Instead only a 
function relation between the conditional means jtij and variances a 2 , a 2 = V[}ii) for some variance function 
V, is needed. Then the integral 

r» y-t 



o 2 V(t) 



dt 



behaves like a log-likelihood function under mild conditions and is called the quasi-likelihood. The quasi- 
likelihood includes GLMs as special cases with appropri ately chosen variance function V(-). Readers are re- 



ferred to the classical text (jMcCullagh and Nelder 



19831 Table 9.1) for the commonly used quasi-likclihoods. 



By slightly abusing our notation, we assume a known link function between the conditional mean ^ and 
linear predictor xf(3, /x = fi(xf f3) and denote Qi((3) — Q(n(xf f3),yi). Then the quasi-likelihood with 
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generalized l\ regularization takes the form 



-Q(J3) + P \\V(3 - - - Eti Qi(P) + P\\VP - d\\u 



(17) 



which is a special case of the general form ([T]). Specific choices of the regularization matrix V and constant 
vector d lead to lasso, fused-lasso, trend filtering, and many other applications. 

For the path algorithm, we require the first two or three derivatives of the complete quasi-likelihood. 
Denoting rj = Xf3 with X = (x\, x\, • • • , a^)*, we have 



VQG9) = [£)/i(r,)] t y- 1 (y-/x)/a 2 =X t [ J D / i(r,)]V- 1 (y-/i)/<r 2 ) 
H(3) = d 2 Q(3) = [{y-n) t V- 1 ®X t ]-D 1 ^{r 1 )-X/a 2 , 
DH{3) = d a Q(j3) = [X* ® (y — p) t y _1 ® X*] • D 3 fi(rj) ■ X/a 2 , 



(18) 



where V is a n-by-n diagonal matrix with diagonal entries V{[i{x\y)), Dfi(rj) is a n-by-n diagonal matrix 
with diagonal entries //(x*/3), D 2 fi(r]) is a n 2 -by-n matrix with (n(i — 1) + i, i) entry equal to ii"(x\3) for 
i = 1, . . . , n and otherwise, and D 3 [i(r]) is a n 3 -by-n matrix with (n 2 (i — 1) + n(i — 1) + i, i) entry equal 
to n'"(X j 3) fo r i = 1, . . . ,n and otherwise. Note for GLM with canonical link, these formulas simplify 



(jAgresti 



20021 Section 4.6.4). 

The most widely used l\ regularization is the lasso penalty which imposes sparsity on the regression 
coefficients. For numerical demonstration, we revisit the M&A example introduced in Section [T] without 
discretizing each predictor. We standardize each predictor first and consider the lasso penalized linear 
logistic regression model. Figure [3] shows the lasso solution path for each standardized predictor in the left 
panel and corresponding AIC and BIC scores in the right panel. The order at which predictors enter the 
model matches the more detailed patterns revealed by the varying coefficient model in Figure Q] The almost 
monotone effects of the predictors 'market-to-book ratio', 'cash flow', 'cash', and 'tax' can be captured by 
the usual linear logistic regression and these covariates are picked up by lasso first. The nonlinear effects 
shown in the other predictors are likely to be missed by the linear logistic regression. For instance, the 
quadratic effects of 'log market equity' shown in the regularized estimates in Figure [1] are missed by both 
AIC and BIC criteria. 



All the generalized lasso problems studied in (jTibshirani and Tavlor 



20 111 ) for Gaussian linear regression 



naturally generalize to GLMs or quasi-likelihoods and are subject to the EPSODE path algorithm. This 
leads to applications to lasso or fused-lasso penalized GLMs, outlier detections, trend filtering, and image 
restoration for GLMs. For instance, cubic trend filtering is performed on five predictors o f the M fcA example 



in Section 1. The graph- guided penalized linear regression proposed in ([Chen et al 



20101 ) can also be 



generalized to GLMs or quasi-likelihoods. Suppose each node % of a graph is assigned a regression coefficient 
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Figure 3: M&A example revisited. Lasso solution path on the seven standardized predictors. 
Pi. In graph penalized regression, the objective function takes the form 



£(/3) + A G ^ 



ft ( s ft 

-=-sgn(ry)— = 
s/di 



A L £|ft 



(19) 



where the set of neighboring pairs i ~ j define the graph, di is the degree of node i, and ry is the correlation 
coefficient between i and j. This is simply a special case of (fl7|) when the ratio Ag/Al is fixed. 

7.2 Shape-Restricted Regressions 



Orde r-constrained regression has been an important modeling tool (jRobertson et al 



1988; 



Silvapulle and Sen 



2005). If /3 denotes the parameter vector, monotone regression imposes isotone constraints /3± < Pi < • • • < 
P p or antitone constraints Pi > P% > ■ • • > In partially ordered regression, subsets of the parameters 
are subject to isotone or antitone constraints. In some other problems it is sensible to impose convex or 
concave constraints. Note that if locations of regression parameters are at irregularly spaced time points 
ti < t2 < ■ ■ ■ < t pi convexity translates into the constraints 

Pi+2 — Pi+1 Pi+l — Pi 
U+2 — ti+1 ti+1 — ti 

for 1 < i < p — 2. When the time intervals are uniform, the constraints simplify to /3 i+2 ~ fti+i ^ fti+i ~ Pi, 
i = 1,2, • • ■ ,p — 1. Concavity translates into the opposite set of inequalities. 

Most of previous work has focused on the linear regression problems because of the computat ional and 



20iri 



proposes 



theoretical complexities in the generalized linear model setting. The recent work (jRufibach . 
an active set algorithm for GLMs with order constraints. The EPSODE algorithm conveniently provides 
a solution to the linearly constrained estimation problem @ . The relevant derivatives of loss function are 
listed in (Tr8|) . It is noteworthy that EPSODE not only provides the constrained estimate but also the whole 
path bridging the unconstrained estimate to the constrained solution. Availability of the whole solution path 
renders model selection between the two extremes simple. 
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In the illustrative M&A example of Section [TJ the bin predictors for the 'market-to-book ratio' are 
regularized by the antitone constraint and those for the 'log market equity' covariate by the concavity 
constraint. 

7.3 Gaussian Graphical Models 



In recent years several authors (jFriedman et al 



2008; 



Yuan 



2008T ) proposed to estimate the sparse undirected 



graphical model by using lasso regularizations to the log-likelihood function of the precision matrix, the 
inverse of the variance-covariance matrix. Given an observed variance-covariance matrix £ € R pxp , the 



negative log-likelihood of the precision matrix f2 = X under normal assumption is 



f(Sl) = - log det n + tr(Efi) 



(20) 



with the MLE solution S when S is non-degenerate. A zero in the precision matrix implies conditional 
independence of the corresponding nodes. Graphical lasso proposes to solve 



/(") + ?£ I 



I, 



(21) 



where p > is the tuning constant and u>g denotes the (i , 7')-ele ment of $7. It is well-known that the 



determinant function is log-concave (M agnus and N cudcckcr 



1999) 



Friedman et al 



The refore the loss function / l|20p is 



(2008) proposed an efficient coordinate 



convex and the EPSODE algorithm applies to (|21[) . 

descent proced u re for solving (|21[) at a fixed p. A recent attempt to approximate the whole solution path is 
made by 



Yuan! (|2008f ). Again his path algorithm can be deemed as a primitive predictor-corrector method 
for approximating the ODE solution. 

With symmetry in mind, we parameterize $7 in terms of its lower triangular part by a p(p + l)/2 column 
vector x and let Dfl(x) = ^ccx)* be the corresponding p 2 -by-p(p+ 1)/2 Jacobian matrix. Note DQ.(x) ■ x = 
vec$7(a;) and each row of Dtt(x) has exactly one nonzero entry which equals unity. We list here the first 
three derivatives of /. The proof is straightforward using matrix calculus and omitted for brevity. 

Lemma 7.1. 1. The derivatives for the Gaussian graphical model i20\) with respect to Jl are 

Df(il) = df(n) = [i;ec(-fr 1 + £)]* 

D 2 f(n) = d 2 f(n) = n 1 ® rr 1 

£> 3 /(fi) = -{I n ®K nn ®I n ) 

■ [rr 1 ® rr 1 ® ve^rr 1 ) + w^rr 1 ) ® n~ x ® jr 1 ], 



where K nn is the commutation matrix ItMaanus and Neudecken . \1999l ). 
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2. The derivatives for the Gaussian graphical model i2U\) with respect to x 



Df( x ) = Df(n) ■ Dn(x) 

H(x) = D 2 f(x) = [Dn(a:)]* • D 2 f(n) ■ Dft(x) 
DH(x) = D 3 f(x) = {[Dtlix)} 1 ® [Dn(x)f} ■ D 3 f{n) ■ Dft(x). 

When the covariance matrix X is nonsingular, EPSODE can be initiated either at p = or p = oo. When 
X is singular, we start from p = oo and the extended version of EPSODE Q14p should be used. If starting 
at p = 0, the solution is initialized at X ; If starting at p = oo, the solution is initialized at diag(a^ 1 ). 
Minimization of both the unpenalized and penalized objective function has to be performed over the convex 
cone of symmetric, positive semidefinite matrices, which is not explicitly incorporated in our path following 
algorithm. The next result ensures the positive dcfinitcncss of the path solution. 

Lemma 7.2 (Positive defmiteness along the path). The path solution Sl(p) minimizes \21)) over the convex 
cone of symmetric, positive semidefinite matrices. 

Proof. Both symmetry and the stationarity condition ([3]) are preserved by the path following. Therefore 
the path solution Sl(p) constitutes a minimum of the penalized objective function (1211) . This implies that 
det(17(/o)) > otherwise f(Cl(p)) — oo, contradicting with the optimality of □ 

We illustrate the path algorithm by the classic al example of 88 stud ents' scores on five math courses - 



1973. Table 1.2.1). Figure H displays the 



mechanics, vector, algebra, analysis, and statistics (|Mardia et al. 
solution path from EPSODE. The top three edges chose n by lasso are analysis-algebra, statistics-algebra, 



and algebra- vector, matching the findings in (|Yuan 



2008). 



7.4 Nonparametric Density Estimation 

As part of the trend of statistics shifting from parametric models to semi- or non-parametric models, nonpara- 
metric density estimation has attracted much attention in recent years. The maximum likelihood estimation 
for nonparametric density estimation often involves a nontrivial, high-dimensional constrained optimization 
problem. In this section, we briefly demonstrate the applicability of EPSODE to the maximum likeli- 
hood estimation o f univa riate log-concave density. Extensions t o multivariate log-concay e densi ty estimation 



(Cule et al 



2009 



20101 ) and fc-monotone density estimation (jBalabdaoui and Wellner 



20071) will be pur- 



sued elsewhere. Some algorithms have been s pecifically crafted for log-concave density est i matio n, e.g., the 



iterative convex minorant algorit hm (ICMA) ( Groeneboom and Wellner 



recently an active-set algorithm (jDuembgen et al 



1992 



Jongbloed . 



1998) and more 



2007). It is noteworthy that, besides providing an al- 



ternative solver for log-concave density estimation, EPSODE offers the whole solution path between the 
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0.2 




Figure 4: Solution path of the 10 edges in lasso-regularized Gaussian graphical model for the math score 
data. The top three edges chosen by lasso are labeled. 



unconstrained and constrained solutions. For example, an "almost" log-concave density estimate in the 
middle of the path can be chosen that minimizes cross-validation or prediction error. This adds another 
dimension to the flexibility of nonparametric modeling. 

The family of log-concave densities is an attractive modeling tool. It includes most of the commonly 
used parametric distributions as special cases. Examples include n ormal, gamma w ith shape parameter > 1, 



20091 ) gives a recent review. A 



and beta densities with both parameters > 1. The survey paper (jWaltherl . 
probability density g(-) on K is log-concave if its logarithm 4>(x) = lng(x) is concave. Given iid observations, 
from an unknown distribution of densi ty g(-), with su pport at points x\ < ... < x n with corresponding 



frequencies pi, . . . ,p n , it is well-known (jWalther 



20021 ) that the nonparametric MLE of g exists, is unique 



and takes the form g = exp(0) where 4> is continuous and piecewise linear on [xi, x„], with the set of knots 
contained in {xi, . . . , x„}, and <fi — — oo outside the interval [xi, x n ]. This implies that the MLE is obtained 
by minimizing the strictly convex function 



n n — 1 „i 

y;^ + y> fe+1 -x fe ) / e (i-t)4> k +t4> k+ i dL 
i=i fe=i Jo 



over 4> = (4>i,4>2, • 



(>„)* € W 1 subject to constraints 



< 



2,...,n-l. 



The consistency of the MLE is proved by 



Pal et al 



MLE studied in (Balabdaoui et al 



2009) 



(|2007l ) and the pointwise asymptotic distribution of the 
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Following iDuembgen et al.l (|2007l) . we use notations 



= S n = 0, Si = x i+ i - x 

J{r,s) 



i = 1, . . . , n — 1 

eS - e " r ^ s 



e (l-t)r+*- dt= J ,- r 
i, \e r r = s 



n-l 



Then the objective function becomes 

n 

/(0) = - ^Pi^i + ^hJ^k^k+i)- 
i=i k=i 

The path algorithm requires up to the third derivative of the objective function / 



[V/(0)]i = -ft + £i_iJ, 

= [d 2 /(0)] y - 

^-1^02(^-1,^1 





5[ff(</>)] M _i 



0lW-l,^t) + SiJioi&i, <Pi+l 








<5»-i Jia(0i-i,0i 

8i-lJ03((f>i-l,(f>i 
8iJ2l{<t>i,<l>i+l) 





j = i-l 

+ Si J2o(<j>i, 0i+l ) j = i 

j = * + 1 

otherwise 

fc = i-l 
fc = i 
otherwise 

fc = i- 1 

+ ^'^3o(0i,^i+l) = « 

fc = i + 1 
otherwise 



0[if(0)] 



8iJn{4>i,<t>i+i) k = i 
SiJi2(<fii,<f>i+i) k = i + 1 . 
^ otherwise 

Interchanging the derivative and integral operators, justified by the dominated convergence theorem, gives 
a useful representation for the partial derivatives of J 



Jab(r,s) 



aa+b pi 

J(r, s) = / (1 - t) a t b e^ l - t > +ts dt. 



dr a ds b 

We derive a recurrence relation for J a b{f, s) to facilitate its computation 



Lemma 7.3. J a b(r,s) satisfy following recurrence 
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1. For r 7^ s, 



Joo(r,s 
Jw(r, s 
J 01 (r,s 
Jn(r,s 

Jab{r, s 
Jab(r, s 



s — r 
e r 



+ 



s — r ' (s — r) 2 
e s e s — e r 



s — r (s — r) 2 

e s + e r 2(e s - e r ) 

(s — r) 2 (s — r) 3 

a + b + s-r a - 1 

- Ja-i,&(r, s) Ja-2,b{r, s) 



s — r 
a+b — s + r 



Ja !& _i(r, s) + J a . b -2(r, s) 

s — r 



2. For r = s, 



Jab(r,s) 



e r a\b\ 



■Ja-l.- 



a + b + 1 



Ja 



b-1- 



(a + 6+1)! a + b+1 

Proof. We recognize J a b(r,s) as the Kummer confluent hypergeometric function multiplied by a constant 

(&+%.) (s-r) k 



Jab(r, s) = e r B(a + l,b+l)J2 



^( a + 6 + 2 ) (fc) 



= e r B(a + l,b+l) 1 F 1 (b + l,a + b + 2 \ s - r) 
= e s B(a + l,b+l) 1 F 1 (a + l,a + b + 2\ r - s) 

(a + b + 1)! 

Then the results follow from the well-known recurrence relation for Kummer hypergeometric function 

T-i / | v (1 - y){y + z - 2) (l-y)(2-y) 
i-Fi(a:,2/ I = — ? — r iFi(a;,y- 1 | z) + — — - \Fi(x,y - 2 \ z). 



(x~y + l)z 
and symmetry J ab (r, s) = J ba (s, r). 



{x-y+ l)z 



□ 



To illustrate the path algorithm for this problem, we simulate n — 25 points from the extremal distribution 
Gumbel(0,l)- Figure [5] displays the constrained and unconstrained estimates of 4>i a- n d the solution path 
bridging the two. 

8 Conclusions 

In this article we propose a generic path following algorithm EPSODE that works for any regularization 
problems of form (jTJ). The advantages are its simplicity and generality. Path following only involves solving 
ODEs segment by segment and is simple to implement using popular softwares such as R and Matlab. Besides 
providing the whole regularization path, it also gives a solver for linearly constrained optimization problems 
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X 



Figure 5: Log-concave density estimation, n = 25 points are generated from Gumbel(0,l) distribution. Top 
left: Unconstrained and concavity-constrained estimates <fi. Top right: Solution path. Bottom left: Empirical 
cdf and the cdf of MLE density. 
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that frequently arise in statistics. Our applications to shape-restricted regressions and nonparametric density 
estimation are special cases in particular. 

At least two extensions deserve further study. Current algorithm requires sufficient smoothness (twice 
differentiable) in the loss function. This precludes certain applications with non-smooth objective function, 
e.g., the Huber loss in robust estimation and the loss function in quantile regression. Generalization of 
our path algorithm to regularization of these loss functions requires further research. Another restriction 
in our formulation is the linearity in the regularization terms. In sp arse regressions, several authors have 



proposed nonlinear a nd non-conv e x pen alties. The bridge regression ([Frank and F riedman 



and SCAD penalties (|Fan and Li 



Friedman 



1993; 



Fu 



1998) 



2008), when the 



2001) fall into this category. As observed in 
penalty is not convex, the solution path may not be continuous and poses difficulty in path following, which 
strongly depends on the continuity and smoothness of the solution path. Fortunately, in these problems, the 
discontinuities only occur when new variables enter or leave the model. A promising strategy is to initialize 
the starting point of next segment by solving an equality constrained optimization problem. This again 
invites further investigation. 
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